Introduction

This case study analyzes 2025 Cyclistic bike-share data to answer the question:

How do annual members and casual riders use Cyclistic bikes differently?

Using R and the tidyverse, I cleaned, processed, and analyzed the dataset to explore patterns in ride frequency, time of year, day of week, time of day, ride length, and bike type. The goal is to identify meaningful differences in behavior between rider types and provide actionable insights for marketing and operational strategies to convert single-ride and day-pass (“casual”) riders into annual “members”.


1. Load Packages

Install and load relevant packages.

# install.packages("tidyverse")
# install.packages("knitr")
# install.packages("kableExtra")
library(tidyverse)
library(knitr)
library(kableExtra)

2. Import and Combine Data

Get a list of all CSV files in the data folder.

csv_file_paths <- list.files(path = "data/2025-divvy-tripdata", pattern = "*.csv", full.names = TRUE)


Read all csv files and combine them into a single tibble data frame.

tripdata_2025_combined <- map_dfr(csv_file_paths, read_csv)


Let’s take a look at the raw combined data in our frame variable.

glimpse(tripdata_2025_combined)
## Rows: 5,552,994
## Columns: 13
## $ ride_id            <chr> "7569BC890583FCD7", "013609308856B7FC", "EACACD3CE0…
## $ rideable_type      <chr> "classic_bike", "electric_bike", "classic_bike", "c…
## $ started_at         <dttm> 2025-01-21 17:23:54, 2025-01-11 15:44:06, 2025-01-…
## $ ended_at           <dttm> 2025-01-21 17:37:52, 2025-01-11 15:49:11, 2025-01-…
## $ start_station_name <chr> "Wacker Dr & Washington St", "Halsted St & Wrightwo…
## $ start_station_id   <chr> "KA1503000072", "TA1309000061", "13235", "13235", "…
## $ end_station_name   <chr> "McClurg Ct & Ohio St", "Racine Ave & Belmont Ave",…
## $ end_station_id     <chr> "TA1306000029", "TA1308000019", "13278", "13071", "…
## $ start_lat          <dbl> 41.88314, 41.92915, 41.94823, 41.94823, 41.94823, 4…
## $ start_lng          <dbl> -87.63724, -87.64915, -87.66407, -87.66407, -87.664…
## $ end_lat            <dbl> 41.89259, 41.93974, 41.94553, 41.94374, 41.94374, 4…
## $ end_lng            <dbl> -87.61729, -87.65887, -87.64644, -87.66402, -87.664…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…

We can confirm here that started_at and ended_at are in date/time format.


We’ll start by creating a cleaned working dataset.
We want to preserve the original raw dataset and perform all transformations on a cleaned copy.

tripdata_2025_cleaned <- tripdata_2025_combined

3. Data Cleaning

Let’s start by checking if any columns have missing data.

tibble(
  column = names(tripdata_2025_cleaned),
  missing_count = colSums(is.na(tripdata_2025_cleaned))
  ) %>%
  mutate(missing_count = format(missing_count, big.mark = ",")) %>%
  kable(
    col.names = c("Column to Check", "Missing Data Count")
    ) %>% 
  kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "bordered"))
Column to Check Missing Data Count
ride_id 0
rideable_type 0
started_at 0
ended_at 0
start_station_name 1,184,673
start_station_id 1,184,673
end_station_name 1,243,305
end_station_id 1,243,305
start_lat 0
start_lng 0
end_lat 5,535
end_lng 5,535
member_casual 0

There do seem to be a lot of missing station names and ids, but since these aren’t as important for our analyses, we can safely ignore them. For a more detailed analysis, we might be able to impute some of these missing values by matching latitude and longitude values.
Out of the important columns for our analyses, it looks like only end_lat and end_lng have missing data (5,535 each).


As a quick follow-up, we can see if any of our rows with missing end_lat or end_lng values have end_station_name or end_station_id values we could try to match up to impute the values.

tripdata_2025_cleaned %>%
  filter(is.na(end_lat) | is.na(end_lng)) %>%
  summarise(
    missing_station_name = sum(is.na(end_station_name)),
    missing_station_id = sum(is.na(end_station_id))
  ) %>%
  mutate(
    missing_station_name = format(missing_station_name, big.mark = ","),
    missing_station_id = format(missing_station_id, big.mark = ",")
  ) %>%
  kable(
    col.names = c("Missing End Station Name", "Missing End Station ID")
  ) %>% 
  kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("bordered"))
Missing End Station Name Missing End Station ID
5,535 5,535

It seems that all rows with missing end_lat or end_lng values are also missing these values, so we don’t have a reliable way to impute the data. With only 5,535 missing values for each, representing a very small fraction of the dataset, this should be fairly negligible for our analyses.


Since our analysis depends on comparing rider types, we should verify that the member_casual column only contains either member or casual.

tripdata_2025_cleaned %>%
  count(member_casual) %>%
  mutate(n = format(n, big.mark = ",")) %>%
  kable(
    col.names = c("Rider Type", "Count")
    ) %>% 
  kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("bordered"))
Rider Type Count
casual 1,999,497
member 3,553,497

This looks correct, as we only have two unique values in the member_casual column.
We can also see that there are about 78% more member rides than casual rides.
Since there are no missing or incorrect values in member_casual, we know all trips can be categorized by rider type and we don’t need to filter anything out yet.


Now we’ll check if any ended_at date/times are the same as the started_at date/times.
This could still be valid if the start and end stations are the same due to user canceling the trip or some technical issue.

sum(tripdata_2025_cleaned$ended_at == tripdata_2025_cleaned$started_at, na.rm = TRUE)
## [1] 0

No started_at and ended_at times are equal.


Now let’s check if any ended_at date/times occur before the started_at date/times.

sum(tripdata_2025_cleaned$ended_at < tripdata_2025_cleaned$started_at, na.rm = TRUE)
## [1] 29

We do have an issue here with 29 ended_at date/times that come before the started_at.


We’ll remove the rows where ended_at is before started_at, since these represent invalid trip times.
With the amount of data we have, removing them should have a minimal effect on our dataset.

tripdata_2025_cleaned <- tripdata_2025_cleaned %>%
  filter(ended_at > started_at)


We’ll add three new columns. Ride_length will be in minutes for the trip duration. Month and day_of_week will be ordered factors showing the full label.

tripdata_2025_cleaned <- tripdata_2025_cleaned %>%
  mutate(
    ride_length = as.numeric(difftime(ended_at, started_at, units="mins")),
    month = month(started_at, label = TRUE, abbr = FALSE),
    day_of_week = wday(started_at, label = TRUE, abbr = FALSE)
    )


Let’s take a look to make sure the new columns were added correctly.

glimpse(tripdata_2025_cleaned)
## Rows: 5,552,965
## Columns: 16
## $ ride_id            <chr> "7569BC890583FCD7", "013609308856B7FC", "EACACD3CE0…
## $ rideable_type      <chr> "classic_bike", "electric_bike", "classic_bike", "c…
## $ started_at         <dttm> 2025-01-21 17:23:54, 2025-01-11 15:44:06, 2025-01-…
## $ ended_at           <dttm> 2025-01-21 17:37:52, 2025-01-11 15:49:11, 2025-01-…
## $ start_station_name <chr> "Wacker Dr & Washington St", "Halsted St & Wrightwo…
## $ start_station_id   <chr> "KA1503000072", "TA1309000061", "13235", "13235", "…
## $ end_station_name   <chr> "McClurg Ct & Ohio St", "Racine Ave & Belmont Ave",…
## $ end_station_id     <chr> "TA1306000029", "TA1308000019", "13278", "13071", "…
## $ start_lat          <dbl> 41.88314, 41.92915, 41.94823, 41.94823, 41.94823, 4…
## $ start_lng          <dbl> -87.63724, -87.64915, -87.66407, -87.66407, -87.664…
## $ end_lat            <dbl> 41.89259, 41.93974, 41.94553, 41.94374, 41.94374, 4…
## $ end_lng            <dbl> -87.61729, -87.65887, -87.64644, -87.66402, -87.664…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## $ ride_length        <dbl> 13.957950, 5.072400, 11.591667, 3.570550, 2.573817,…
## $ month              <ord> January, January, January, January, January, Januar…
## $ day_of_week        <ord> Tuesday, Saturday, Thursday, Thursday, Thursday, Th…

Looks good to me.


Finally, we’ll sort the whole dataset by the started_at date/time.
This is optional, as later analyses don’t require sorted data, but it’s helpful to know the trips are now sorted in chronological order.

tripdata_2025_cleaned <- tripdata_2025_cleaned %>% 
  arrange(started_at)


Data Verification and Integrity

Before beginning analysis, we verified the integrity of the dataset, checked for missing or invalid values, and assessed its reliability, objectivity, and potential biases.


Data integrity

  • The dataset was checked for missing and incorrect values
  • 29 rows where the trip start and end times were invalid were removed
  • 5,535 rows had missing end station latitude and longitude, which should be negligible if used in analysis

Bias and credibility

  • This data includes trips recorded by Cyclistic’s system
  • Demographic information, such as age or gender, is not included
  • There is no unique rider identification included
  • The dataset is large enough that any outliers are unlikely to significantly bias overall trends

ROCCC check

  • Reliable: The data comes from Cyclistic’s tracking system
  • Objective: Recording is automated, not self-reported
  • Comprehensive: Covers all trips that took place in 2025
  • Current: Data is from the most recent full year available at the time of analysis
  • Cited: The dataset is publicly available from Motivate International, and cited accordingly



4. Data Analysis

4.1 Rider Type

To understand overall usage, we first look at the number of rides taken by each rider type.

tripdata_2025_cleaned %>%
  count(member_casual) %>%
  mutate(n = format(n, big.mark = ",")) %>%
  kable(
    col.names = c("Rider Type", "Number of Rides")
    ) %>%
  kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("bordered"))
Rider Type Number of Rides
casual 1,999,488
member 3,553,477

As we noted before, there are roughly 78% more member rides than casual rides.


To visualize that in a quick bar chart:

# Let's store hex colors in a vector for casual and member to reuse in other charts
ride_type_colors <- c("member" = "#619CFF", "casual" = "#F8766D")

tripdata_2025_cleaned %>%
  count(member_casual) %>%
  ggplot(aes(x = member_casual, y = n, fill = member_casual)) +
  geom_col(width = 0.4, show.legend = FALSE) +
  geom_text(aes(label = scales::comma(n)), vjust = -0.6, size = 4) +
  scale_fill_manual(values = ride_type_colors) +
  scale_y_continuous(labels = scales::comma, expand = expansion(mult = c(0, 0.2))) +
  labs(
    x = "Rider Type",
    y = "Number of Rides"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(size = 12),
    axis.title.x = element_text(size = 11, margin = margin(t = 8)),
    axis.title.y = element_text(size = 11, margin = margin(r = 8))
  )



4.2 Time of Year

We begin examining when rides occurred using monthly ride counts. Seasonal time of year may show differences between casual and member rides, so we’ll look at counts by month. While it is reasonable to expect that warmer months will have more rides overall, our focus here is on identifying any differences in seasonal patterns between rider types.

monthly_counts <- tripdata_2025_cleaned %>%
  count(month, member_casual) %>%
  pivot_wider(
    names_from = member_casual,
    values_from = n
  ) %>%
  arrange(month)

monthly_counts %>%
  mutate(
    casual = format(casual, big.mark = ","),
    member = format(member, big.mark = ",")
  ) %>%
  kable(
    col.names = c("Month", "Casual Rides", "Member Rides")
    ) %>% 
  kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "bordered"))
Month Casual Rides Member Rides
January 24,124 114,527
February 27,757 124,144
March 85,862 212,268
April 109,239 262,137
May 182,770 319,845
June 292,006 386,795
July 323,352 440,106
August 337,878 452,439
September 265,268 449,294
October 224,038 422,058
November 99,082 257,401
December 28,112 112,463

This table lets us compare the count for each month of the year, but since we know member rides occur far more frequently overall, it’s difficult to directly compare the two groups.


To address this, we’ll recreate the chart using monthly percentages, calculated as the proportion of each rider type’s total rides in each month. This normalizes the data and provides a clearer basis for comparison.

monthly_counts %>%
  mutate(
    casual = paste0(format(casual, big.mark = ","), " (*", sprintf("%.1f", 100 * casual / sum(casual)), "%*)"),
    member = paste0(format(member, big.mark = ","), " (*", sprintf("%.1f", 100 * member / sum(member)), "%*)")
  ) %>%
  select(month, casual, member) %>%
  kable(
    col.names = c("Month", "Casual Rides", "Member Rides")
    ) %>% 
  kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "bordered"))
Month Casual Rides Member Rides
January 24,124 (1.2%) 114,527 (3.2%)
February 27,757 (1.4%) 124,144 (3.5%)
March 85,862 (4.3%) 212,268 (6.0%)
April 109,239 (5.5%) 262,137 (7.4%)
May 182,770 (9.1%) 319,845 (9.0%)
June 292,006 (14.6%) 386,795 (10.9%)
July 323,352 (16.2%) 440,106 (12.4%)
August 337,878 (16.9%) 452,439 (12.7%)
September 265,268 (13.3%) 449,294 (12.6%)
October 224,038 (11.2%) 422,058 (11.9%)
November 99,082 (5.0%) 257,401 (7.2%)
December 28,112 (1.4%) 112,463 (3.2%)

Using percentages gives us a better way to compare and we can quickly see that there are some obvious differences between the percentages of casual and member rides for some months.


Let’s plot the monthly percentages on a grouped bar chart to give us an even easier way to spot any trends between rider types.

month_pct <- tripdata_2025_cleaned %>%
  count(month, member_casual) %>%
  group_by(member_casual) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup()

ggplot(month_pct, aes(x = month, y = pct, fill = member_casual)) +
  geom_col(position = position_dodge(width = 0.8), width = 0.7) +
  scale_y_continuous(
    labels = scales::percent_format(accuracy = 1),
    expand = expansion(mult = c(0, 0.05))
  ) +
  scale_fill_manual(values = ride_type_colors) +
  labs(
    x = "Month",
    y = "Percentage of Rides",
    fill = NULL
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(size = 11),
    axis.text.y = element_text(size = 10),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10)),
    legend.position = c(0.075, 0.80),
    legend.justification = c(0, 1),
    legend.direction = "vertical",
    legend.text = element_text(size = 11),
    legend.key.spacing.y = unit(8, "pt"),
    legend.background = element_rect(
      fill = scales::alpha("white", 0.8),
      color = NA
    ),
    panel.grid.minor = element_blank()
  )

The chart shows clear seasonal differences in rider behavior. A larger proportion of casual rides occur during the summer months of June, July, and August, while a larger proportion of member rides occur during the winter months of December, January, and February.


To better summarize these seasonal patterns, we can group months into broader season periods and compare the proportion of rides taken by each rider type.

tripdata_2025_cleaned <- tripdata_2025_cleaned %>%
  mutate(
    season = case_when(
      month %in% c("December", "January", "February") ~ "Winter",
      month %in% c("March", "April", "May") ~ "Spring",
      month %in% c("June", "July", "August") ~ "Summer",
      month %in% c("September", "October", "November") ~ "Fall"
    ),
    season = factor(season, levels = c("Winter", "Spring", "Summer", "Fall"))
  )

# Pivot wider and create kable
tripdata_2025_cleaned %>%
  count(season, member_casual) %>%
  pivot_wider(
    names_from = member_casual,
    values_from = n
  ) %>%
  arrange(season) %>%
  mutate(
    casual = paste0(format(casual, big.mark = ","), " (*", sprintf("%.1f", 100 * casual / sum(casual)), "%*)"),
    member = paste0(format(member, big.mark = ","), " (*", sprintf("%.1f", 100 * member / sum(member)), "%*)")
  ) %>%
  select(season, casual, member) %>%
  kable(
    col.names = c("Season", "Casual Rides", "Member Rides")
  ) %>% 
  kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "bordered"))
Season Casual Rides Member Rides
Winter 79,993 (4.0%) 351,134 (9.9%)
Spring 377,871 (18.9%) 794,250 (22.4%)
Summer 953,236 (47.7%) 1,279,340 (36.0%)
Fall 588,388 (29.4%) 1,128,753 (31.8%)

This makes it very clear how big a difference there is in what time of year the different groups ride. The percentages of summer and winter rides show the largest differences. While winter casual rides show more than double the percentage of member rides, it still represents a relatively small proportion compared to summer, which accounts for nearly half of all casual rides and over a third of member rides.


4.3 Day of the Week

Next, we check how many rides on each day of the week, comparing member vs casual.
Hypothesis: Member rides will be more frequent on weekdays, while casual rides will peek on weekends. This likely reflects members using the bikes for their daily commute, while casual riders use them more for recreation. Since we know the total counts are so different, we’ll include percentages.

day_of_week_counts <- tripdata_2025_cleaned %>%
  count(day_of_week, member_casual) %>%
  pivot_wider(
    names_from = member_casual,
    values_from = n
  ) %>%
  arrange(day_of_week)

day_of_week_counts %>%
  mutate(
    casual = paste0(format(casual, big.mark = ","), " (*", sprintf("%.1f", 100 * casual / sum(casual)), "%*)"),
    member = paste0(format(member, big.mark = ","), " (*", sprintf("%.1f", 100 * member / sum(member)), "%*)")
  ) %>%
  select(day_of_week, casual, member) %>%
  kable(
    col.names = c("Day of the Week", "Casual Rides", "Member Rides")
    ) %>% 
  kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "bordered"))
Day of the Week Casual Rides Member Rides
Sunday 331,783 (16.6%) 382,293 (10.8%)
Monday 228,253 (11.4%) 502,767 (14.1%)
Tuesday 225,586 (11.3%) 563,070 (15.8%)
Wednesday 221,538 (11.1%) 550,154 (15.5%)
Thursday 258,045 (12.9%) 576,005 (16.2%)
Friday 320,077 (16.0%) 528,989 (14.9%)
Saturday 414,206 (20.7%) 450,199 (12.7%)

We can quickly see that a much higher percentage of casual rides are on the weekends, whereas member rides tend to be during the week.


We can plot these day of the week percentages on a line chart to compare rider types.

# Calculate percentages and store in a new tibble for later use
day_of_week_percent <- tripdata_2025_cleaned %>%
  count(day_of_week, member_casual) %>%
  group_by(member_casual) %>%
  mutate(percentage = n / sum(n) * 100) %>%
  ungroup()

last_points <- day_of_week_percent %>%
  group_by(member_casual) %>%
  filter(day_of_week == max(day_of_week))

# Create a line chart using the stored variable
ggplot(day_of_week_percent, aes(x = day_of_week, y = percentage, color = member_casual, group = member_casual)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  scale_color_manual(values = ride_type_colors) +
  scale_y_continuous(limits = c(10, 21), breaks = seq(10, 21, 2)) +
  labs(title = "Percentage of Rides by Day of Week and Rider Type",
       x = NULL, y = "Percentage of Rides", color = "Rider Type") +
  theme_minimal() +
  theme(
    legend.position = "none",
    axis.text.x = element_text(size = 11),
    axis.title.y = element_text(margin = margin(r = 8))
  ) +
  geom_text(
    data = last_points,
    aes(label = member_casual),
    hjust = -0.2,
    vjust = 0.22,
    size = 5
  ) +
  coord_cartesian(clip = "off")

This clearly shows how different the curves are.
Casual rides are more prevalent on weekends, dipping lower mid-week.
Member rides are lowest on the weekends, with the most rides occurring on Tuesday, Wednesday, and Thursday.


4.4 Time of Day

We’ll now look for any differences in the time of day between casual and member rides. We’ll use the started_at date/time and break the hours of the day into eight three-hour groups. Eight provides a good balance between keeping our analysis uncluttered with too many buckets, but still splitting the day into enough distinct time periods.
Hypothesis: Member rides will tend to be most frequent during rush hour times (7am-10am and 3pm-6pm), while casual rides will be higher in the middle of the day. This again would reflect members using the bikes for their commute and casual riders using them for recreation.

tripdata_2025_cleaned <- tripdata_2025_cleaned %>%
  mutate(
    # Extract the ride starting hour (0-23)
    start_hour = hour(started_at),
    
    # Create the time intervals
    start_time_bucket = cut(
      start_hour,
      breaks = seq(0, 24, by = 3),
      right = FALSE,
      labels = c("Midnight-3am", "3am-6am", "6am-9am", "9am-Noon", "Noon-3pm", "3pm-6pm", "6pm-9pm", "9pm-Midnight")
    )
  )

# Pivot wider and create kable
tripdata_2025_cleaned %>%
  count(start_time_bucket, member_casual) %>%
  pivot_wider(
    names_from = member_casual,
    values_from = n
  ) %>%
  arrange(start_time_bucket) %>%
  mutate(
    casual = paste0(format(casual, big.mark = ","), " (*", sprintf("%.1f", 100 * casual / sum(casual)), "%*)"),
    member = paste0(format(member, big.mark = ","), " (*", sprintf("%.1f", 100 * member / sum(member)), "%*)")
  ) %>%
  select(start_time_bucket, casual, member) %>%
  kable(
    col.names = c("Time of Day", "Casual Rides", "Member Rides")
  ) %>% 
  kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "bordered"))
Time of Day Casual Rides Member Rides
Midnight-3am 80,722 (4.0%) 64,470 (1.8%)
3am-6am 27,987 (1.4%) 51,043 (1.4%)
6am-9am 146,638 (7.3%) 557,280 (15.7%)
9am-Noon 266,031 (13.3%) 479,728 (13.5%)
Noon-3pm 399,600 (20.0%) 564,248 (15.9%)
3pm-6pm 520,540 (26.0%) 953,078 (26.8%)
6pm-9pm 373,521 (18.7%) 642,121 (18.1%)
9pm-Midnight 184,449 (9.2%) 241,509 (6.8%)

We do see some differences here, but only in a few timespans, notably 6am-9am where member rides are much higher and noon-3pm where casual rides are much higher. Much of the other sections are very close.


Let’s plot this out on another grouped bar chart to visualize where the differences are.

time_of_day_pct <- tripdata_2025_cleaned %>%
  count(start_time_bucket, member_casual) %>%
  group_by(member_casual) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup()

ggplot(time_of_day_pct, aes(x = start_time_bucket, y = pct, fill = member_casual)) +
  geom_col(position = position_dodge(width = 0.8), width = 0.7) +
  scale_y_continuous(
    labels = scales::percent_format(accuracy = 1),
    expand = expansion(mult = c(0, 0.05))
  ) +
  scale_fill_manual(values = ride_type_colors) +
  labs(
    x = "Time of Day",
    y = "Percentage of Rides",
    fill = NULL
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(size = 11),
    axis.text.y = element_text(size = 10),
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10)),
    legend.position = c(0.075, 0.80),
    legend.justification = c(0, 1),
    legend.direction = "vertical",
    legend.text = element_text(size = 11),
    legend.key.spacing.y = unit(8, "pt"),
    legend.background = element_rect(
      fill = scales::alpha("white", 0.8),
      color = NA
    ),
    panel.grid.minor = element_blank()
  )

We can see from the bar chart that several time periods have very similar ride distributions between rider types: 3am-6am, 9am-noon, 3pm-6pm, and 6pm-9pm.

Casual rides make up a higher percentage from midnight-3am, noon-3pm, and 9pm-midnight, suggesting greater use outside of traditional commuter hours.

In contrast, the percentage of member rides from 6am-9am is more than double that of casual rides, strongly suggesting commuter-driven usage during morning rush hours.

Overall, this partially supports our hypothesis. The 6am-9am window aligns with typical morning commute hours, while late-night and midday use would be more consistent with recreational rides. The noon-3pm time window shows similar high usage for both rider types, which may reflect an overlap of the evening commute with recreational riding.


4.5 Ride Length

To start analyzing the ride length (duration) for each rider type, let’s look at the mean, median, min, and max.

tripdata_2025_cleaned %>%
  group_by(member_casual) %>%
  summarize(
    mean_ride_length = mean(ride_length),
    median_ride_length = median(ride_length),
    min_ride_length = min(ride_length),
    max_ride_length = max(ride_length)
  ) %>%
  kable(
    digits = 2,
    format.args = list(big.mark = ",", nsmall = 2),
    col.names = c("Rider Type", "Mean (mins)", "Median (mins)", "Min (mins)", "Max (mins)")
  ) %>%
  kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("bordered"))
Rider Type Mean (mins) Median (mins) Min (mins) Max (mins)
casual 22.60 11.41 0.00 1,574.90
member 12.33 8.58 0.00 1,499.97

Casual riders seem to take longer rides, nearly double on average.


We can take a look at the ride length data on a density plot here. We limit the chart to 50 to get a better view of the highest density areas, because there is a long tail to the right, which would continue to over 1,500 minutes.

tripdata_2025_cleaned %>%
  ggplot(aes(x = ride_length, fill = member_casual)) +
  geom_density(alpha = 0.55) +
  scale_x_continuous(limits = c(0, 50), name = "Ride Length (mins)") +
  ylab("Density") +
  ggtitle("Ride Length Distribution by Rider Type") +
  scale_fill_manual(values = ride_type_colors) +
  theme_minimal() +
  theme(
    legend.position = c(0.65, 0.65),
    legend.key.spacing.y = unit(0.25, "lines"),
    legend.background = element_rect(linewidth = 0.7),
    legend.title = element_blank(),
    axis.title.x = element_text(margin = margin(t = 8)),
    axis.title.y = element_text(margin = margin(r = 8))
  )

This shows us that members tend to take shorter rides, as the curve peaks higher and slightly earlier.
Casual riders are more likely to take longer rides, as the casual curve has a heavier right tail.
Both rider types exhibit similarly shaped distributions, with notable overlap between 5 and 10 minutes.


4.6 Rideable Type

Since we have two types of bikes in our data, we can compare which bike type was more preferred on member and casual rides. Again, since the totals for ride type are so different, we’ll add percentages in to make it easier to compare.

rideable_counts <- tripdata_2025_cleaned %>%
  mutate(rideable_type = recode(
    rideable_type,
    "classic_bike" = "Classic Bike",
    "electric_bike" = "Electric Bike"
  )) %>%
  count(rideable_type, member_casual) %>%
  pivot_wider(
    names_from = member_casual,
    values_from = n
  )

rideable_counts %>%
  mutate(
    casual = paste0(format(casual, big.mark = ","), " (*", sprintf("%.1f", 100 * casual / sum(casual)), "%*)"),
    member = paste0(format(member, big.mark = ","), " (*", sprintf("%.1f", 100 * member / sum(member)), "%*)")
  ) %>%
  select(rideable_type, casual, member) %>%
  kable(
    col.names = c("Rideable Type", "Casual Rides", "Member Rides"),
  ) %>%
  kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "bordered"))
Rideable Type Casual Rides Member Rides
Classic Bike 672,670 (33.6%) 1,275,359 (35.9%)
Electric Bike 1,326,818 (66.4%) 2,278,118 (64.1%)

Both casual and member riders show a similar distribution of bike types, favoring electric by some margin. While casual riders have a slightly higher proportion of rides on electric bikes, the difference is minimal (~2.3%) and unlikely to have a meaningful impact on our analysis.



4.7 Geographic Patterns in Bike Usage

While our earlier analyses explored temporal differences between casual and member riders, we can also examine *where** rides begin and end. By mapping each station and coloring it based on the proportion of casual vs member rides, we can see how usage patterns vary across the city.

station_usage <- tripdata_2025_cleaned %>%
  group_by(start_station_name, start_lat, start_lng) %>%
  count(member_casual) %>%
  pivot_wider(names_from = member_casual, values_from = n, values_fill = 0) %>%
  mutate(
    total = casual + member,
    pct_casual = casual / total,
    pct_member = member / total,
    start_station_name = ifelse(is.na(start_station_name), "", start_station_name),

    # log-based marker size (min 3px, max 20px)
    marker_radius = scales::rescale(log1p(total), to = c(3, 20))
  )

library(leaflet)

pal <- colorNumeric(
  palette = colorRampPalette(c(ride_type_colors["member"], ride_type_colors["casual"]))(20),
  domain = station_usage$pct_casual
)

leaflet(station_usage) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addCircleMarkers(
    lng = ~start_lng,
    lat = ~start_lat,
    radius = ~marker_radius,
    color = ~pal(pct_casual),
    stroke = FALSE,
    fillOpacity = 0.85,
    label = ~lapply(
      paste0(
        ifelse(start_station_name == "" | is.na(start_station_name), "", paste0(start_station_name, "<br>")),
        "Total rides: ", total, "<br>",
        "Casual: ", sprintf("%.1f%%", 100 * pct_casual), "<br>",
        "Member: ", sprintf("%.1f%%", 100 * pct_member)
      ), 
      htmltools::HTML
    )
  ) %>%
  addLegend(
    pal = pal,
    values = ~pct_casual,
    title = "% Casual Rides",
    labFormat = labelFormat(suffix = "%", transform = function(x) 100 * x)
  )

We can see some clear patterns here. Blue stations with the heaviest concentration of member rides are concentrated in “The Loop” and adjacent “North Side”, following the train lines. Red casual ride stations are located around the fringes, particularly in the south and northwest. There’s also a thin string of red along the coastline, representing the “Lakefront Trail” along the water, which is a popular leisure and tourist path, but also commonly used by commuters.


5. Conclusions & Recommendations

In this section, we’ll review the analyses we’ve done so far, state our conclusions, and make some recommendations for further analysis and what the company can do to try to convert riders from casual to annual members.

5.1 Comparison of Differences

Let’s compare several key behaviors we examined to identify where the largest differences occur between casual rides and annual member rides.

We’ll use:

  • the percentage of winter rides This compares the largest seasonal difference and helps test the commuter hypothesis, since we can theorize that recreational riders are more likely to ride primarily during the warmer months, while commuters following a routine are more likely to continue riding through colder temperatures

  • the percentage of weekend (Friday, Saturday, Sunday) vs weekday rides
    This compares the biggest difference in day of the week and checks our commuter hypothesis

  • the percentage of rides during morning commute hours (6am-9am)
    This compares the time period with the biggest difference and checks our commuter hypothesis

  • the median (typical) ride length
    We use the median here to include a broader spectrum of ‘typical’ rides without being affected as much by the long tail

  • the percentage of electric bikes vs classic bikes
    Although we know the difference here is minimal, it’s one of our tracked statistics in the data and useful to compare, and made easier because there are only two rideable types recorded
final_comparison <- tripdata_2025_cleaned %>%
  group_by(member_casual) %>%
  summarise(
    "Winter rides (Dec-Feb)" = mean(season == "Winter") * 100,
    "Weekend rides (Fri–Sun)" = mean(day_of_week %in% c("Friday", "Saturday", "Sunday")) * 100,
    "Morning commute (6am–9am)" = mean(start_hour >= 6 & start_hour < 9) * 100,
    "Typical ride length (median)" = median(ride_length, na.rm = TRUE),
    "Electric bike rides" = mean(rideable_type == "electric_bike") * 100,
    .groups = "drop"
  ) %>%
  pivot_longer(cols = -member_casual, names_to = "Metric", values_to = "Value") %>%
  pivot_wider(names_from = member_casual, values_from = Value) %>%
  mutate(
    # Calculate a Percent Difference for each factor: (Cas - Mem) / Mem
    Percent_Diff = abs((casual - member) / member) * 100,
  ) %>%
  # Sort by the largest Percent Difference
  arrange(desc(Percent_Diff))

final_comparison %>%
  mutate(
    # Add labels for each value
    Casual_Display = if_else(
      str_detect(Metric, "length"),
      paste0(sprintf("%.1f", casual), " mins"),
      paste0(sprintf("%.1f", casual), "%")
    ),
    Member_Display = if_else(
      str_detect(Metric, "length"),
      paste0(sprintf("%.1f", member), " mins"),
      paste0(sprintf("%.1f", member), "%")
    ),
    Percent_Diff = paste0(sprintf("%.1f", Percent_Diff), "%")
  ) %>%
  select(Metric, Casual_Display, Member_Display, Percent_Diff) %>%
  kable(col.names = c("Metric", "Casual Rides", "Member Rides", "Percent Difference")) %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "bordered"), position = "left")
Metric Casual Rides Member Rides Percent Difference
Winter rides (Dec-Feb) 4.0% 9.9% 59.5%
Morning commute (6am–9am) 7.3% 15.7% 53.2%
Weekend rides (Fri–Sun) 53.3% 38.3% 39.2%
Typical ride length (median) 11.4 mins 8.6 mins 33.0%
Electric bike rides 66.4% 64.1% 3.5%

Note: We calculate Percent Difference as the absolute value of ((Casual - Member) / Member) * 100. This gives us a way to directly compare the different types of factors on the same relative scale.

We can conclude that the percentage of rides in the winter represents the single largest behavioral difference between casual and member riders, with the 6am-9am morning commute not far behind. Weekend vs weekday riding and ride length also show substantial differences. There was very little variation in which bike type was used between the two groups.


5.2 Summary of Key Differences

The most pronounced behavioral differences between casual and annual member riders occur in when and for how long bikes were used, rather than what bike was used.

Member rides show a significantly higher proportion during weekday travel and during early morning hours (6am-9am) compared to casual rides. While these factors were analyzed independently, their combined pattern is consistent with commuter travel, supporting the hypothesis that annual members are more likely to use Cyclistic bikes for work-related travel. In contrast, casual rides are more concentrated on weekends and during the midday or late evening hours, aligning more closely with recreational travel. Seasonally, annual member rides had a much higher percentage during the winter months, which would be consistent with riders commuting as part of their routine continuing to use the bikes through the coldest months of the year, while recreational riders would be much less likely to ride through the winter. As we saw, earlier, however, this represents a small proportion of riders for both groups. Summer rides, which still exhibited a sizable difference, accounted for nearly half of casual rides and over a third of member rides. This further supports the idea of casual riders preferring warmer times of the year for recreational rides.

Differences in ride length were somewhat less substantial, while still showing casual rides to be longer on average. Rides longer than 12 minutes, especially, were much more likely to be casual, which may further suggest recreational or leisure use.

Finally, bike type preference showed minimal variation between the two groups. Both casual and member riders strongly favored electric bikes, indicating that rideable type is not a strong differentiator between rider categories.


5.3 Recommendations

Based on our analysis of the 2025 Cyclistic ride data, we can conclude that the most meaningful differences between casual and annual member rides were observed in time of year, day of week, time of day, and ride length, while the type of bike used showed little statistical difference between the groups. Casual rides occurred more frequently on weekends, during midday and late evening hours, in the warmer months, and were more likely to be longer trips.

To support the marketing team’s goal of converting casual riders into annual members, we recommend offering targeted incentives that encourage casual riders to take shorter rides during weekday commuter hours. These incentives could take the form of discounts sent to single-ride and full-day pass customers. These discounts would be specifically for rides taken on weekdays, during morning or evening commuter hours, and/or of a shorter duration. Additionally, discounts could be distributed during or targeted for the cooler seasons.

The goal of this strategic initiative would be to encourage casual riders to try out using Cyclistic bikes for work-related travel. By repeatedly offering these discounted prices, casual riders may discover that commuting by bike is both feasible and beneficial, increasing the likelihood that it becomes part of their regular routine. Over time, follow-up messaging could introduce annual membership options, once habitual usage is established.

Because bike type preference differed minimally between casual and member riders, marketing efforts focused on rideable type are unlikely to meaningfully influence conversion and should be deprioritized in favor of time- and behavior-based strategies, though it should be noted for other initiatives that electric bikes were strongly preferred by both groups.

Finally, the geographic distribution of rides further supports a commuter-focused strategy. While “The Loop” and “North Side” currently see more member use, the presence of casual rides in these high-density interior zones suggests an opportunity to convert some of these riders to members. We recommend targeting casual riders at these interior stations with incentives for weekday, peak-hour usage, encouraging them to incorporate Cyclistic into their daily routines.


5.4 Suggestions for Further Analysis

One significant limitation of the current data is that there is no rider identification, so it isn’t currently possible to analyze rider trends. With unique rider ids, we would be able to further test the commuter theory (especially by combining it with day of the week, time of day, and station ids). We could also see how often riders use the service and determine if something like number of trips per day/week/month is another differentiating factor.

Along with the lack of unique rider id, there is no demographic data, such as age or gender of riders. This would be very helpful in further analysis and especially in marketing recommendations.
Another area for further analysis would be to isolate the weekday and then examine whether weekend vs weekday has an effect on the other factors, like time of day and ride length. This could help support whether time of day differences were, in fact, driven by commuting behavior.

Additionally, station pattern analysis, with or without unique rider ids, could prove useful, especially in determining areas with heavy weekday morning usage or for tracking commuter patterns. It could show which stations are used more often by the different member groups. Combined with unique rider ids, it could show whether member riders are more likely to repeat the same trips.

Future analysis should also involve segmenting station maps by day of the week or time of day. This would determine if the casual patterns in the south and northwest fringes persist during the work week, or if some of these areas see a shift towards member commuter rides.



6. Notes and Citations

The data analyzed in this case study was obtained from Divvy’s publicly available trip data and includes individual monthly CSV files for the year 2025. The dataset was accessed via the Divvy Trip Data repository and is provided for public use.

Data Source: Divvy Bikes. (2025). Divvy Trip Data. Retrieved from https://divvy-tripdata.s3.amazonaws.com/index.html



This case study was completed as part of the Google Data Analytics Professional Certificate program.
All analysis and conclusions reflect the author’s independent work.